BiGill - combined notebook pages

Running on Hummingbird

d-128-95-149-219:bsmap-2.74 sr320$ ./bsmap -w 1000 -a /Volumes/NGS\ Drive/NGS\ Raw\ Data/Oyster_BSseq/filtered_Unlabeled_NoIndex_L003_R1.fastq.gz -d /Volumes/web/cnidarian/TJGR_RepProMask_TE.fa -o /Volumes/web/cnidarian/BiGill_BSMAP_TEonly_v2.sam -p 16

BSMAP v2.74
Start at:FriJun1413:19:062013

Input reference file:/Volumes/web/cnidarian/TJGR_RepProMask_TE.fa  (format: FASTA)
Loadin119787 db seqs, total size 39414569 bp.2 secs passed
total_kmers:43046721
Create seed table.5 secs passed
max number of mismatches: read_length *8%  max gap size:0
kmer cut-off ratio:5e-07
max multi-hits:1000    max Ns:5   seed size:16   index interval:4
quality cutoff:0base quality char:'!'
min fragment size:28    max fragemt size:500
start from read #1  end at read #4294967295
additional alignment: T in reads => C in reference
mapping strand:++,-+
Single-end alignment(16 threads)
Input read file:/Volumes/NGS Drive/NGS RawData/Oyster_BSseq/filtered_Unlabeled_NoIndex_L003_R1.fastq.gz   (format: gzipped FASTQ)
Output file:/Volumes/web/cnidarian/BiGill_BSMAP_TEonly_v2.sam   (format: SAM)

BSMAP complete on Hummingbird
Total number of aligned reads: 2545683 (1.8%) Done. Finished at Sat Jun 15 20:45:07 2013 Total time consumed: 113161 secs

python methratio.py -d /Volumes/web/cnidarian/TJGR_RepProMask_TE.fa -z -g -o /Volumes/web/cnidarian/BiGill_methratio_TEonly_A.txt -s /Volumes/Bay3/Software/BSMAP/bsmap-2.73/samtools /Volumes/web/cnidarian/BiGill_BSMAP_TEonly_v2.sam

Having look at raw output:

Based on methratio - NA methratio always

CI_lower and CI_upper also NA

7.1 methratio.py

python script to extract methylation ratios from BSMAP mapping results. Require python 2.X.

For human genome, methratio.py needs ~26GB memory.

For systems with limited memory, user can set the -c/—chr option to process specified chromosomes only,

and combine results for all chromosomes afterwards.

Usage: python methratio.py [options] BSMAP_MAPPING_FILES

BSMAP_MAPPING_FILES could be one or more output files from BSMAP.

The format will be determined by the filename suffix.

(SAM format for *.sam and *.bam, BSP format for other filenames.)

Options:

-h, —help show this help message and exit

-o FILE, —out=FILE output file name. (required)

-d FILE, —ref=FILE reference genome fasta file. (required)

-c CHR, —chr=CHR process only specified chromosomes. [default: all]

example: —chr=chr1,chr2 (this uses ~4.5GB compared with ~26GB for the whole genome)

-s PATH, —sam-path=PATH

path to samtools. [default: none]

-u, —unique process only unique mappings/pairs.

-p, —pair process only properly paired mappings.

-z, —zero-meth report loci with zero methylation ratios.

-q, —quiet don’t print progress on stderr.

-r, —remove-duplicate

remove duplicated mappings to reduce PCR bias.

(This option should not be used on RRBS data. For WGBS, sometimes

it’s hard to tell if duplicates are caused by PCR due to high seqeuncing depth.)

-t N, —trim-fillin=N

trim N fill-in nucleotides in DNA fragment end-repairing. [default:2]

(This option is only for pair-end mapping. For RRBS, N could be detetmined by the distance between

cuttings sites on forward and reverse strands. For WGBS, N is usually between 0~3.)

-g, —combine-CpG combine CpG methylaion ratio from both strands. [default: False]

-m FOLD, —min-depth=FOLD

report loci with sequencing depth>=FOLD. [default: 1]

-n, —no-header don’t print a header line

-i CT_SNP, —ct-snp=CT_SNP

how to handle CT SNP (“no-action”, “correct”, “skip”),

default: “correct”.

“correct”: correct the methylation ratio according to the C/T SNP information

estimated by the G/A counts on reverse strand, see the output format below for details.

“skip”: do not report loci with C/T SNP detected (i.e. detected A on reverse strand)

“no-action”: do not consider C/T SNP.

Output format: tab delimited txt file with the following columns:

1) chromorome

2) coordinate (1-based)

3) strand

4) sequence context (2nt upstream to 2nt downstream in Watson strand direction)

5) methylation ratio, calculated as #C_counts / #eff_CT_counts

6) number of effective total C+T counts on this locus (#eff_CT_counts)

CT_SNP=”no action”, #eff_CT_counts = #CT_counts

CT_SNP=”correct”, #eff_CT_counts = #CT_counts * (#rev_G_counts / #rev_GA_counts)

7) number of total C counts on this locus (#C_counts)

8) number of total C+T counts on this locuso (#CT_counts)

9) number of total G counts on this locus of reverse strand (#rev_G_counts)

10) number of total G+A counts on this locus of reverse strand (#rev_GA_counts)

11) lower bound of 95% confidence interval of methylation ratio, calculated by Wilson score interval for binomial proportion.

12) upper bound of 95% confidence interval of methylation ratio, calculated by Wilson score interval for binomial proportion.

Example:

python methratio.py —chr=chr1,chr2 —ref=hg19.fa —out=methratio.txt rrbsmap_sample*.sam

python methratio.py -d mm9.fa -o meth.txt -p bsmap_sample1.bsp bsmap_sample2.sam bsmap_sample3.bam

python methratio.py -s /home/tools/samtools -t 1 -d arab.fa -o meth.txt bsmap_sample.sam

Note: For overlapping paired hits, nucleotides in the overlapped part should be counted only once instead of twice.

methratio.py can correctly handle such cases for SAM format output, but for BSP format it will still be counted twice,

because the BSP format does not contain mapping information of the mate.

@1 day ago

robertsmac:bin sr320$ ./ipython notebook

^this is getting methylation

Still need to get new Coordinate system (again) for visualization.

GFF are in with C9 Modified for joining….

Will be modifying CDS gff so that it coordinates are related to genes

Will need to to arithmetic based on strand so that 5' 3' orientation holds true

SELECT cds.*,

mRNA.Column4 as mRNA_start,

mRNA.Column5 as mRNA_end

FROM [sr320@washington.edu].[Snapshot of CDS GFF] cds

LEFT Join [sr320@washington.edu].[oyster_v9_mRNA GFF] mRNA

on cds.Column9 = mRNA.Column9

Order by Column1 Desc

Create new Dataset

----

correct stuff

SELECT *,

Case when Column7 = '+'

then Column4 - mRNA_start + 1

Else mRNA_end - Column4 + 1

END as New_Start,

Case when Column7 = '+'

then Column5 - mRNA_start + 1

Else mRNA_end - Column5 + 1

END as New_End,

Column5 - Column4

as exon_size -- trying to check arithmetic

FROM

[sr320@washington.edu].[CDS GFF with Gene start and stop] cd

Order by Column1 Desc, Column9 --makes it easy to see big genes

SELECT CGI_ID,len(sequence) as CDS_length FROM [sr320@washington.edu].[qDOD_Cgigas_gene_fasta]

--------------------

Now want to get genomic structure of gene..

GFF has Start on Stop and presumably includes introns….

Reconfigured to get ID out

Now lets get the corresponding fasta (again)

missing ID is in GFF.

no idea what is going on

Downloading to desktop and look at in TextWrangler.

Running on Hummingbird

BiGill: Figuring out this Effective CT factor

BiGill - Gene Specific Methylation (Take 3)

BiGill - Redoing Gene Specific Data